Advantages of ggplot2
That said, there are some things you cannot (or should not) do with ggplot2:
Before we dive into the specifics, it may be helpful to have the ggplot2 cheat sheet handy.
The basic idea: independently specify plot building blocks and combine them to create just about any kind of graphical display you want. Building blocks of a graph include:
# install.packages("tidyverse")
library(tidyverse)
## ── Attaching packages ──────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
## ✔ tibble 1.4.2 ✔ dplyr 0.7.4
## ✔ tidyr 0.8.0 ✔ stringr 1.3.0
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ── Conflicts ─────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
As a simple example we’ll be using the iris dataset in R, which contains information on sepal and petal length 50 flowers from each of 3 species of iris.
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
The first step in creating a ggplot2 graph is to define a ggplot object. We do this with the function ggplot which initializes the graph. If we read the help file for this function, we see that the first argument is used to specify what data is associated with this object:
ggplot(data = iris)
We can also pipe the data using the package dplyr. So this line of code is equivalent to the one above:
iris %>% ggplot()
In ggplot we create graphs by adding layers. Layers can define geometries, compute summary statistics, define what scales to use, or even change styles. To add layers, we use the the symbol +.
In general, a line of code will look like this:
ggplot(DATA) + LAYER 1 + LAYER 2 + … + LAYER n
Usually, the first added layer defines the geometry.
Geometric objects are the actual marks we put on a plot. Examples include:
Lets say we want to make a scatterplot. Looking at the cheet sheet, we notice that the required geometry is geom_point. As a general rule, most function names will follow the pattern or geom and the name of the geometry connected by an underscore.
These concepts will become clearer once we understand mappings.
In ggplot an aesthetic (shortened to aes) means “something you can see” and will be one of the functions you will most use. This statement conects the data with what we see on the graph, and is referred to the aesthetic mapping.
Each aes statment ususally includes some optional arguments such as:
Below is a basic scatterplot of sepal length versus width.
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point()
We can add also add some arguments to the geometry, like size or alpha:
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point(size = 3, alpha=0.4)
Or to color each class of species in the plot differently:
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) + geom_point()
In the above example, we mapped species to the color aesthetic, but we could have mapped species to the shape aesthetic in the same way. In this case, the shape of each point would reveal its species affiliation.
ggplot(iris) + geom_point(aes(Sepal.Length, Sepal.Width, shape = Species))
ggplot(faithful, aes(x = waiting)) +
geom_histogram(binwidth = 8, color = "black")
ggplot(faithful, aes(x = waiting)) +
geom_histogram(binwidth = 1, fill = "steelblue", color = "black")
ggplot(data = mpg, aes(x = fl, fill = class)) +
geom_bar()
ggplot(economics, aes(date, uempmed)) + geom_line()
ggplot(iris, aes(Species, Sepal.Length)) + geom_bar(stat = "identity")
ggplot(faithful, aes(waiting)) + geom_density()
ggplot(faithful, aes(waiting)) + geom_density(fill = "blue", alpha = .1)
geom_smooth allows you to view a smoothed mean of data. Each method can be used in different settings (linear regression, generalized linear models, GAMs, loess, etc. )
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point() +
geom_smooth(method = "lm")
Statistical transformations and data summary. All geoms have associated default stats, and vice versa(e.g. binning for a histogram or fitting a linear model).
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
ggplot(birthwt, aes(factor(race), bwt)) + geom_boxplot()
ggplot has a special technique called faceting that allows the user to split one plot into multiple plots based on a factor included in the dataset.
Subsetting data to make lattice plots can be really powerful!
# single column, multiple rows
ggplot(iris, aes(Sepal.Length, Sepal.Width)) +
geom_point() +
facet_grid(Species ~ .)
# single row, multiple columns
ggplot(iris, aes(Sepal.Length, Sepal.Width)) +
geom_point() +
facet_grid(. ~ Species)
Using the mpg database in R, we can make a plot a plot of the highway miles (hwy) a car gets by its engine size (displ) for each car manufacturer:
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_wrap(~ manufacturer)
We can now make the faceted plot by splitting further by class using color (within a single plot):
ggplot(data = mpg, aes(x = displ, y = hwy, color = class)) +
geom_point() +
facet_wrap(~ manufacturer)
Themes are a great way to define custom plots. ggplot is highly customizable!
# install.packages(ggthemes)
library(ggthemes)
# Then add one of many themes to your plot
# theme_stata(), theme_excel(), theme_wsj(), theme_solarized()
# see ?ggthemes for more info
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point(size = 1.2, shape = 16) +
facet_wrap( ~ Species) +
theme_economist()
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point(size = 1.2, shape = 16) +
facet_wrap( ~ Species) +
theme_bw()
ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class)) +
geom_smooth() +
labs(title = "Relationship between engine size and miles per gallon (mpg)",
x = "Highway MPG",
y = "Engine displacement (liters)") +
theme_bw()
## `geom_smooth()` using method = 'loess'